In this lab you will learn
Data wrangling in R: how to download a data set from GitHub and process it for viewing and analysis.
In general, we make statistical inferences by comparing our data against a null model.
In a spatial context, we can test data for nonrandom processes by comparing against a Poisson distribution.
We can use quadrat-based analysis or spatial statistics like Ripley’s K.
The spatial scale of observation is critical for identifying nonrandom processes.
Species that have similar preferences for local environments will tend to be found near each other
library(tidyverse)
library(spatstat) ## for Ripley's K analysis
library(igraph) ## for community-finding analysis
library(ggplotify) ## for function as.ggplot()
library(wordspace) ## for function dist.matrix()
theme_set(theme_bw())
theme_update(panel.grid = element_blank())
All populations exist in space. This may seem like a trivial observation, but it has profound implications for population dynamics. This is true of all life, and particularly so for sessile (i.e. stationary) organisms, such as plants. The spatial location of a plant is crucial for its ability to survive and reproduce, as it will govern its interactions with the environment. Conversely, a plant’s ability to survive in different environments will determine where it can recruit.
The dispersal stage is arguably one of the most consequential life stages of plants. Dispersal, as most spatial processes, is distance-based: seeds are more likely to disperse to some distances from the parent tree than others. This non-randomness in the movement of plants is likely to create non-random emergent patterns at the population level, and ultimately at the community (i.e. multispecies) level.
In this lab, we will examine the spatial distribution of trees in a tropical rainforest. We will look for signs of non-random ecological processes, such as distance-based dispersal and local environmental filters, by comparing the spatial distribution of our species against a null model where trees are randomly located. Our study site is Barro Colorado Island, in the middle of the Panama canal.
The 50-hectare plot at Barro Colorado Island, Panama, is a 1000 meter by 500 meter rectangle of forest inside of which all woody trees and shrubs with stems at least 1 cm in stem diameter have been censused. Every individual tree in the 50 hectares was permanently numbered with an aluminum tag in 1982, and every individual has been revisited six times since (in 1985, 1990, 1995, 2000, 2005, and 2010). In each census, every tree was measured, mapped and identified to species. Details of the census method are presented here, and a detailed description of the seven-census results is given here.
Figure: Barro Colorado Island, Panama. Photo credit: Christian Ziegler.
First, we need to load the data. We will pull it from my GitHub BCI repository, where I collected data files taken from the Smithsonian Tropical Research Institute (STRI) public BCI data repository.
Run the following command (it will take a second because it is a large dataset):
bci =
get(
load(
url(
'https://github.com/rafaeldandrea/BCI/blob/master/bci.full7.rdata?raw=true'
)
)
) |>
as_tibble()
Now let’s look at the data:
bci
## # A tibble: 394,658 × 20
## treeID stemID tag StemTag sp quadrat gx gy Measu…¹ Censu…² dbh
## <int> <int> <chr> <chr> <chr> <chr> <dbl> <dbl> <int> <int> <dbl>
## 1 1 NA -05599 <NA> swar… "4007" 800. 152. NA NA NA
## 2 2 NA -22851 <NA> hyba… "0718" 152. 379. NA NA NA
## 3 3 NA -24362 <NA> aegi… "0417" 95.2 358. NA NA NA
## 4 4 NA -26589 <NA> beil… "0007" 11.7 151. NA NA NA
## 5 5 NA -26590 <NA> fara… "0004" 7.70 96.2 NA NA NA
## 6 6 NA -26703 <NA> hyba… "0210" 50.1 215. NA NA NA
## 7 7 NA -26746 <NA> tet2… "0218" 58.5 365. NA NA NA
## 8 8 NA -27125 <NA> des2… "0613" 140. 267. NA NA NA
## 9 9 NA -27298 <NA> crot… "0704" 158. 86.4 NA NA NA
## 10 10 NA -27784 <NA> alse… "" NA NA NA NA NA
## # … with 394,648 more rows, 9 more variables: pom <chr>, hom <dbl>,
## # ExactDate <chr>, DFstatus <chr>, codes <chr>, nostems <dbl>, date <dbl>,
## # status <chr>, agb <dbl[1d]>, and abbreviated variable names ¹MeasureID,
## # ²CensusID
This is a large dataset cataloguing almost 395 thousand trees within the 1,000m-by-500m (50-hectare) Forest Dynamic Plot at Barro Colorado Island, Panama.
Notice the data is essentially a table where each row refers to a
tree and each column describes one of the properties of the tree, such
as its physical location, size, and species. For our purposes, we are
interested in the geographic location, size, and species of each tree.
So we will select only those relevant characteristics from the larger
dataset by piping the verb select() to the bci
dataset and specifying the desired variables:
bci =
bci |>
select(quadrat, gx, gy, sp, dbh)
If we now visualize our data, we see only the five relevant columns:
bci
## # A tibble: 394,658 × 5
## quadrat gx gy sp dbh
## <chr> <dbl> <dbl> <chr> <dbl>
## 1 "4007" 800. 152. swars1 NA
## 2 "0718" 152. 379. hybapr NA
## 3 "0417" 95.2 358. aegipa NA
## 4 "0007" 11.7 151. beilpe NA
## 5 "0004" 7.70 96.2 faraoc NA
## 6 "0210" 50.1 215. hybapr NA
## 7 "0218" 58.5 365. tet2pa NA
## 8 "0613" 140. 267. des2pa NA
## 9 "0704" 158. 86.4 crotbi NA
## 10 "" NA NA alsebl NA
## # … with 394,648 more rows
Here, the column quadrat labels which 20m x 20m rectangular section of the 50-ha plot the tree was found, gx and gy give the x and y spatial coordinates of the tree (measured in meters from the bottom left corner of the plot), sp is a 6-letter code giving the species identity, and dbh stands for diameter at breast height (i.e. diameter of the main trunk measured 4.5 ft, or 1.37 m, from the ground).
Notice the NAs in the dbh of some trees. This is because this dataset also lists trees that were dead by the time this census was taken in 2010, but were included for compatibility with the previous censuses. We will exclude dead trees from our analysis. In fact, we will focus on trees meeting a size threshold of dbh > 100 mm, which are generally considered to have recruited past the sapling stage. We achieve this by filtering the data set to those trees whose dbh was greater than 100 mm:
bci =
bci |>
filter(dbh >= 100)
bci
## # A tibble: 20,802 × 5
## quadrat gx gy sp dbh
## <chr> <dbl> <dbl> <chr> <dbl>
## 1 4924 994. 488. gustsu 308
## 2 4924 990. 489. virosu 358
## 3 4924 994. 498. quaras 318
## 4 4923 993. 469. protte 479
## 5 4923 982. 474. brosal 1290
## 6 4922 983 453. quaras 500
## 7 4921 992. 430. anacex 1660
## 8 4921 992. 432. tri2tu 398
## 9 4921 998. 434 poular 400
## 10 4921 999. 430. beilpe 602
## # … with 20,792 more rows
Notice that this trims down the data to about 21 thousand trees.
**Note: Make sure to apply this filter in your answer worksheet!**
Q1: What is a possible problem with the approach of using a single dbh cutoff for all species as indication that a tree has recruited past the sapling stage? What alternative may be used instead?
We can plot our data to examine the spatial distribution of our trees. In the plot below, each dot is a tree, colored by its species identity.
plot_bci =
bci |>
ggplot(aes(gx, gy, group = sp, color = sp)) +
theme(
legend.position = 'none',
aspect.ratio = 0.5
)
plot_bci +
geom_point(size = .5)
Let’s zoom in on a 200m \(\times\) 100m patch at the center of the forest plot for a more detailed visual:
plot_bci +
geom_point() +
coord_cartesian(xlim = c(400,600), ylim = c(200, 300))
The question is, are those trees randomly distributed on the plot, or is there a non-random pattern to their spatial distribution?
Let’s first ignore the species label, and treat all trees the same. We can then imagine each of the ~21k trees as a point on a plane, and we want to know whether the points are randomly scattered.
The simplest model of spatial randomness is to assume an initially empty plot to which we add N trees, one at a time, under the following rules:
These simple assembly rules correspond mathematically to drawing a set of N independent gx and gy coordinates, such that for each tree any value gx \(\in\) [0, 1000] and gy \(\in\) [0, 500] are are equally likely to be drawn.
As we have seen in lecture, if we then divide our space into many smaller sections (quadrats) and count the number of trees in each section, we expect those counts to follow a Poisson distribution.
The plot below zooms shows 20m \(\times\) 20m quadrats on our 200m \(\times\) 100m central patch:
plot_bci +
geom_point() +
theme(panel.grid.major = element_line(color = 'grey')) +
scale_x_continuous(breaks = seq(400, 600, 20)) +
scale_y_continuous(breaks = seq(200, 300, 20)) +
coord_cartesian(xlim = c(400,600), ylim = c(200, 300))
tabulation =
bci |>
group_by(quadrat) |>
count() |>
ungroup()
tabulation
## # A tibble: 1,250 × 2
## quadrat n
## <chr> <int>
## 1 0000 12
## 2 0001 19
## 3 0002 15
## 4 0003 16
## 5 0004 19
## 6 0005 21
## 7 0006 10
## 8 0007 17
## 9 0008 19
## 10 0009 18
## # … with 1,240 more rows
If indeed tree counts inside the quadrats \(n = 12, 19, 15, 16, ...\) are Poisson-distributed, then the probability that a quadrat contains k trees should be
\[P(n = k) = e^{-\lambda} \frac{\lambda ^ k}{k!}\],
where \(\lambda\) is the mean tree count across all quadrats.
We can easily find the value of \(\lambda\) by running
lambda = mean(tabulation$n)
lambda
## [1] 16.6416
Q2. Since the BCI plot is 1000 m \(\times\) 500 m and the quadrats are 20 m \(\times\) 20 m, this means there are 1,250 quadrats in the plot. Based on this and the mean number of trees per quadrat above, how many trees are there in the plot?
Q3: Recall that each row of the tibble bci lists
one tree in the plot. So we can find out how many trees there are on the
BCI plot directly – i.e. without the quadrat middleman – by simply
typing nrow(bci), or equivalently, piping the verb
nrow to the dataset bci. Show code that does
the latter. Be careful not to assign the result to the object
bci, as that would turn it into a number!
Q4. Recall that in a Poisson distribution, the variance
equals the mean. Therefore if the tree counts in the quadrats are indeed
sampled from a Poisson distribution, their variance should be similar to
their mean. Calculate the variance of tabulation$n and
check whether that is the case. Is it close to the mean?
A more rigorous test of whether the data are Poisson-distributed is to fit a Poisson distribution to it and check the goodness of fit. A Poisson distribution has only one free parameter, namely the mean, \(\lambda\). We have already calculated our best estimate for the parameter as the sample mean above, \(\lambda \simeq 16.6\). So now we can compare the empirical distribution of the tree counts against predictions from the fitted Poisson distribution.
data_hist = hist(tabulation$n, plot = FALSE, breaks = 15)
observed =
tibble(
n = data_hist$mids,
density = data_hist$density
)
expected =
tibble(
n = (min(tabulation$n) - 1):(max(tabulation$n) + 1),
density = dpois(n, lambda = lambda)
)
plot_histogram =
ggplot() +
geom_col(data = observed, aes(n, density), color = 'black', fill = 'grey50') +
geom_line(data = expected, aes(n, density), size = 1.5, color = rgb(153/255, 0, 0)) +
theme(aspect.ratio = .5) +
labs(x = 'Tree count', y = 'Frequency') +
ggtitle('Histogram of tree counts across quadrats')
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
plot_histogram
This looks like a pretty good fit! The frequency of quadrats with different tree counts (grey bars) matches the Poisson expectations (red curve) pretty well.
Q5: Rerun the code above, this time changing the
breaks argument in function hist on the first
line to 35. This will remake the histogram using thinner bins. The
Poisson fit still looks pretty decent, but a small bias in the Poisson
predictions arguably becomes more visible: namely, the Poisson fit
slightly overpredicts crowded quadrats and underpredicts sparse
quadrats. Explain in your own words how the bias is visible from the
plot.
A better way to visually compare observed data against fitted probability distributions is to look at the empirical cumulative distribution function (ecdf) in the data and compare it to the predicted cumulative distribution function, \(F(k) \equiv P(n \leq k)\), which reads: “\(F(k)\) is defined as the probability that a randomly selected species has up to \(k\) trees”. We can do that with the following code:
expected_cumulative =
tibble(
n = (min(tabulation$n) - 1):(max(tabulation$n) + 1),
y = ppois(n, lambda = lambda)
)
plot_cumulative =
ggplot() +
stat_ecdf(data = tabulation, aes(n), geom = 'point') +
geom_line(data = expected_cumulative, aes(n, y), color = rgb(153/255, 0, 0)) +
theme(aspect.ratio = .5) +
labs(x = 'Tree count', y = 'Cumulative proportion')
plot_cumulative
In this plot, the y axis shows the proportion of quadrats with tree count up to the amount shown on the x axis.
Q6: Explain why the empirical distribution is always monotonic (i.e. non-decreasing).
Note: Unlike the histogram, the ecdf plot does not require binning the data. This is an advantage of the ecdf, as the shape of the histogram will depend on the number (or width) of bins that we choose to use. Another reason the ecdf is a better visual comparison is that if the data is sparse or the bins are too narrow, some bins may be empty, creating illusory “valleys” in the shape of the histogram.
Back to BCI, the Poisson fit looks very good! It seems that BCI trees in general are indeed distributed randomly across the 50-hectare plot. As we discussed, this is precisely what we would expect if those trees were randomly located around the plot with no influence from each other or external factors such as local environmental conditions (soil nutrient concentrations, drainage, topography, etc) or interactions between the trees.
Is this conclusive proof that recruitment of trees in BCI is a fully random process?
To answer that, let’s now look at an individual species rather than the full aggregate. We will focus our analysis on Gustavia superba, a small understory tree also known as the heaven lotus tree.
Figure: Gustavia superba, a small understory tree. Photo by the National Gardening Association. Flower, by The Garden of Eden
The code below shows the spatial distribution of heaven lotus trees
by filtering our entire data set bci for only those trees
whose species code is gustsu:
plot_gustsu =
bci |>
filter(sp == 'gustsu') |>
ggplot(aes(gx, gy, color = sp)) +
theme(
legend.position = 'none',
aspect.ratio = 0.5
) +
ggtitle('Gustavia superba')
plot_gustsu +
geom_point()
The spatial distribution doesn’t look random.
We can now repeat the quadrat analysis above for this species. Let’s plot the 20 m \(\times\) 20 m quadrats:
plot_gustsu +
geom_point(size = .5) +
theme(panel.grid.major = element_line(color = 'grey')) +
scale_x_continuous(breaks = seq(0, 1000, 20)) +
scale_y_continuous(breaks = seq(0, 500, 20)) +
theme(axis.text = element_blank())
We run into a problem: given the reduced tree count now that we are looking at a single species, the vast majority of quadrats are empty, and most of the others contain only 1 or 2 trees. In order for our quadrat analysis to make sense, we need to use bigger quadrats than 20 m \(\times\) 20 m. Let us instead use 100 m \(\times\) 100 m quadrats.
plot_gustsu +
geom_point() +
theme(panel.grid.major = element_line(color = 'grey')) +
scale_x_continuous(breaks = seq(0, 1000, 100)) +
scale_y_continuous(breaks = seq(0, 500, 100)) +
theme(axis.text = element_blank())
Note: In general, when we try to fit data to a statistical distribution, we want to have a large number of data points and some variation across the data. In quadrat analysis, these considerations will weigh into our choice of quadrat size. The chosen size of 100 m \(\times\) 100 m strikes a good balance in our case.
The code below tabulates the counts of Gustavia superba across 100m \(\times\) 100m quadrats.
quadrat_finder = function(gx, gy, Lx, Ly){
nx = 1000 / Lx
ny = 500 / Ly
qx = findInterval(gx, Lx * (0:nx))
qy = findInterval(gy, Ly * (1:ny))
quadrat = qx + nx * qy
}
tabulation_gustsu =
bci |>
mutate(quadrat = quadrat_finder(gx, gy, Lx = 100, Ly = 100)) |>
group_by(quadrat) |>
summarize(n = sum(sp == 'gustsu'), .groups = 'drop')
tabulation_gustsu
## # A tibble: 50 × 2
## quadrat n
## <dbl> <int>
## 1 1 8
## 2 2 2
## 3 3 3
## 4 4 10
## 5 5 5
## 6 6 7
## 7 7 3
## 8 8 1
## 9 9 5
## 10 10 11
## # … with 40 more rows
Q7. a) What is the mean count of heaven lotus trees across
all quadrats? b) Assign the result from (a) to an object named
lambda_gustsu. Show your code. c) Using the full data set,
calculate the percentage of the trees in the BCI plot that are heaven
lotus trees.
Q8. Based on the mean-vs-variance test, does it seem that the spatial distribution of Gustavia superba on BCI is well described by a Poisson process?
Q9. In the birds-on-a-line example from lecture, we found that the variance in bird counts across segments was much lower than the mean, and concluded that the birds were evenly spaced. How do your results for the heaven lotus tree compare, and what do you conclude? Does your conclusion match what you would guesstimate by visual inspection of the Gustavia superba plot above?
The code below plots the cumulative empirical distribution function for G. superba.
expected_cumulative_gustsu =
tibble(
n = (min(tabulation_gustsu$n) - 1):(max(tabulation_gustsu$n) + 1),
y = ppois(n, lambda = lambda_gustsu)
)
plot_cumulative_gustsu =
ggplot() +
stat_ecdf(data = tabulation_gustsu, aes(n), geom = 'point') +
geom_line(data = expected_cumulative_gustsu, aes(n, y), color = rgb(153/255, 0, 0)) +
theme(aspect.ratio = .5) +
labs(x = 'Tree count', y = 'Cumulative proportion') +
ggtitle('Gustavia superba')
plot_cumulative_gustsu
The plot reveals two clear outliers: two quadrats with an enormous number of G. superba trees compared to all other quadrats. (Looking back at the spatial plot, it is not difficult to find those quadrats.)
Sometimes, researchers wish to examine the behavior of the bulk of the data, which they do by setting outliers aside (which is not to say outliers can be ignored!) In our case, one could hypothesize that these outlier quadrats are driving the statistical inference. Perhaps if we remove them from the analysis, the other quadrats are a good Poisson fit?
The code below removes those two heavily populated quadrats, and shows a histogram of the tree counts inside the other quadrats, overlaid with the Poisson fit.
filtered_gustsu =
tabulation_gustsu |>
filter(n < 50)
lambda_filtered_gustsu = mean(filtered_gustsu$n)
data_hist = hist(filtered_gustsu$n, plot = FALSE, breaks = 15)
observed =
tibble(
n = data_hist$mids,
density = data_hist$density
)
expected =
tibble(
n = (min(filtered_gustsu$n) - 1):(max(filtered_gustsu$n) + 1),
density = dpois(n, lambda = lambda_filtered_gustsu)
)
plot_histogram =
ggplot() +
geom_col(data = observed, aes(n, density), color = 'black', fill = 'grey50') +
geom_line(data = expected, aes(n, density), size = 1.5, color = rgb(153/255, 0, 0)) +
theme(aspect.ratio = .5) +
labs(x = 'Tree count', y = 'Frequency') +
ggtitle('Histogram of tree counts across quadrats')
plot_histogram
Compared to the Poisson curve, we see that both very low- and very high-abundance quadrats are more common than expected, and intermediate-abundance quadrats are rarer than expected. (In fact, the data seems bimodal–i.e. there are two local peaks rather than a single peak as in the Poisson curve.)
Q10. How do you interpret this in terms of whether the trees are spatially clumped or evenly dispersed?
Q11. Do you conclude that the outlier quadrats were or were not driving our statistical inference?
A different method of spatial analysis which does not require using quadrats is based on examining the neighborhood of each data point (each tree in our case) at different scales, and look for departures from the Poisson process.
Suppose we drop a pin on a random tree on BCI, and draw a circle of radius \(r\) around it. In a Poisson process, the expected number of trees to be found inside that circle is proportional to the circle’s area, \(\pi r^2\). If we observe that trees on BCI tend to have many more neighbors than this null expectation, that would suggest spatial clustering at scale \(r\). Conversely, finding many fewer neighbors than expected suggests spatial overdispersion at that scale. As we sweep across different scales, we can look for scales where BCI departs from the Poisson process.
This is the idea behind Ripley’s K analysis. For a given data set, we can empirically calculate Ripley’s K function \(K_{Emp}(r)\) by tallying the number of neighbors of a focal tree inside a circle of radius \(r\) centered on that tree, then averaging across all focal trees. For a Poisson process, the theoretical value of Ripley’s K is known, namely \(K_{Pois}(r) = N/A \pi r^2\), where \(N\) is the number of data points and \(A\) is the area of the plot. So for our analysis, we can plot the ratio \(K_{Emp} / K_{Pois}\) and look for departures from the expected ratio of 1.
The code below writes functions to perform our Ripley’s K analysis (you don’t need to udnerstand the code).
BCI_ppp = function(species){
if(species == 'all'){
dtf = bci
} else{
dtf =
bci |>
filter(sp == species)
}
dtf |>
select(gx, gy) |>
unique() |>
as.ppp(W = list(xrange = c(0, 1000), yrange = c(0, 500)))
}
K_tibble = function(species){
x = BCI_ppp(species)
Kinhom(x, correction = 'isotropic') |>
as_tibble() |>
mutate(sp = species)
}
Now we call our function K_tibble() passing
all as the species argument in order to
perform the analysis for all BCI species in aggregate. We then plot the
result.
data_allspp = K_tibble(species = 'all')
plot_K =
data_allspp |>
filter(r > .75) |>
ggplot() +
geom_hline(yintercept = 1, color = rgb(153/255, 0, 0)) +
geom_line(aes(r, iso/ pi / r^2), size = 1) +
theme(aspect.ratio = 1) +
ylab('Observed K / Expected K') +
xlab('circle radius')
plot_K +
scale_x_log10()
As we can see, the empirical Ripley’s K is very close to the null expectation at distances larger than 10 m from a tree. This suggests that indeed at distance scales beyond 10 meters, BCI is indistinguishable from a Poisson process, meaning that the spatial distribution of trees is indistinguishable from random at those scales.
Q12. Is this result consistent with the mean / variance test we performed above when gridding the plot into 20m \(\times\) 20m quadrats? Explain your answer.
Below this distance threshold, however, we have indication of overdispersion, i.e. the immediate neighborhood of existing trees is sparse. In other words, BCI trees are surrounded by fewer close neighbors than expected by chance.
Q13. Thinking about light and nutrients, provide an ecological hypothesis for tree-tree interactions that is consistent with this result.
Whatever the reason, these results suggest that tree-tree interactions on BCI are likely contained within 10 meters of each other.
Let’s rerun the Ripley’s K analysis, this time focusing on the heaven lotus tree.
data_gustsu = K_tibble(species = 'gustsu')
plot_K =
data_gustsu |>
filter(r > .75) |>
ggplot() +
geom_hline(yintercept = 1, color = rgb(153/255, 0, 0)) +
geom_line(aes(r, iso/ pi / r^2), size = 1) +
theme(aspect.ratio = 1) +
ylab('Observed K / Expected K') +
xlab('circle radius')
plot_K +
scale_x_log10()
Now we get the opposite result!
At very short distances (\(r < 2.5\) m) the empirical Ripley’s K is much higher than expected, suggesting strong clumping at these scales. This likely reflects the large clump of trees observed between gx = 600-750 m and gy = 400-500 m. However, we see an overabundance of heaven lotus trees neighboring each other even beyond 2.5 m apart, with the clumping effect decreasing at larger distances.
This is consistent with our previous quadrat-based results for G. superba, except now we are getting a more detailed distance-specific description of the spatial distribution of this species.
Figure: Fruit and seed of the heaven lotus tree. Photos by Spencer Woodard (fruit) and Roger Culos (seed).
Q14: Provide a biological hypothesis for the clumping of G. superba.
Q15. Repeat the analysis above for another species, Trichilia tuberculata (species code “tri2tu”), a large, common canopy tree on BCI. Comparing results for this species with those of Gustavia superba and the analysis for all species, what differences and similarities do you see?
Figure: Flower, leaves, and fruit of Trichilia truberculata. Photos by Riley Fortier (flower), Andrés Hernández, STRI.
While our empirical Ripley’s K departs from expectations at various scales, we need a more sophisticated analysis in order to be confident that these results could not come about by chance. This is because even a true Poisson process may deviate from expectations by chance, just like a fair coin may not turn up exactly 5 heads and 5 tails out of 10 tosses. If we got 6 heads and 4 tails, how confident should we be that the coin is biased towards heads? What if we got 7 heads and 3 tails?
In order to verify that we are seeing convincing departures from random expectations, we must assess how often the random process could generate departures at least as large as the ones we saw. One way to do that is by simulating the random process a large number of times, and then check what proportion of those simulations show even greater departures from expectations than our data. If that percentage is very small, then we can confidently reject the possibility that the apparent non-randomness in our data is a fluke — i.e. we can then confidently reject the null model. In that case, we say the result is statistically significant. This is the idea behind p-value tests, a staple of statistical inference.
The code below performs 1,000 simulations of a Poisson process (our null model). Based on these simulations, it determines, for each distance, what interval of Ripley’s K values above and below the expectations could have arisen by chance. It then plots that interval as a pink ribbon. Values outside the ribbon have a less than 5% chance of arising from the null model (i.e. by chance), and are thus deemed statistically significant.
Note: The last sentence above is only approximately right. It would be exactly true if we used infinitely many replicates of the random process to obtain the ribbon. Of course we can’t do that, so the plotted interval obtained with a finite sample of random processes is an estimate of the theoretical interval that would arise from an infinite set.
focal_species_name = 'Gustavia superba'
focal_species_code = 'gustsu'
focal_pp = BCI_ppp(focal_species_code)
null_tibble =
envelope(
focal_pp,
Kinhom,
nsim = 999,
nrank = 25,
verbose = FALSE
)
plot_RipleysK_with_quantiles =
ggplot() +
geom_hline(yintercept = 1, color = rgb(153 / 255, 0, 0)) +
geom_ribbon(
data =
null_tibble |>
filter(r > 0.75),
aes(x = r, ymin = lo / pi / r^2, ymax = hi / pi / r^2),
fill = 'red',
alpha = 0.3
) +
geom_line(
data =
null_tibble |>
filter(r > .75),
aes(r, obs / pi / r^2),
linewidth = 1
) +
ylab('Observed K / Expected K') +
xlab('circle radius') +
theme(aspect.ratio = 1) +
ggtitle(focal_species_name) +
scale_x_log10()
plot_RipleysK_with_quantiles +
geom_vline(
xintercept = c(1.25, 5),
color = c('darkgreen', 'blue'),
linewidth = 2
)
We can see that the clustering within tree-tree distances \(r < 1.25\) m (green line) is statistically significant, far exceeding the upper bound of our null expectations. Furthermore, the clustering at distances \(>\) 5 m (blue line) is also significant. But now we see that the spatial distribution of Gustavia superba trees at ranges between \(r =\) 1.25 m and 5 m is consistent with the null model — in other words, we cannot reject with greater than 95% confidence the possibility that it could have arisen by chance.
Q16. Repeat the analysis above for Trichilia
tuberculata by tweaking the code above in appropriate places (Also,
delete the green and blue lines by deleting the
geom_vline() verb at the bottom of the code, along with the
plus sign at the end of the line above it). Does any of your conclusions
in Q15 need revising?
Optional bonus point: If you wish, repeat the analysis for
all species on BCI. This will take a lot longer to run, because the
numerical simulations of the random process get slower for the much
higher number of points required, which is now
number_of_points = nrow(bci). You will also need to set
focal_species_name = "All species" and
focal_species_code = "all". Do not attempt this before
finishing the rest of the lab because the code takes a long time to
run!.